Skip to content

Conversation

@pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Oct 17, 2025

What does this PR do?

This PR fixes zombie/defunct processes that are left behind when Elastic Agent re-executes itself during restart. The fix involves:

  1. Decreasing the EDOT collector shutdown timeout from 30 seconds to 3 seconds to accommodate the default 5-second timeout of the coordinator shutdown timeout
    • Adding a safety net that waits an additional second after killing a process to ensure Wait() is called
  2. Improving graceful shutdown handling in the EDOT collector subprocess manager to ensure proper process cleanup
  3. Adding debug logging throughout the shutdown process to better trace subprocess termination
  4. Adding an integration test that verifies no zombie processes are left behind after agent restart

Why is it important?

Root Cause

When the Elastic Agent re-executes itself during restart, the following sequence occurs:

  1. If a subprocess (particularly the EDOT collector or command components) takes longer than the coordinator's 5-second shutdown timeout, the agent proceeds to execve itself
  2. During execve, all threads other than the calling thread are destroyed
  3. This triggers the PDeathSig mechanism we enable for subprocesses
  4. However, the parent process (pre-execve Elastic Agent) never reaps (waits for) the exit status of spawned subprocesses
  5. Result: these subprocesses end up as defunct/zombie processes

Why This Affects EDOT More Than Beats

Beats subprocesses typically terminate almost immediately (within the 5-second window), so they don't become zombies. However, the EDOT collector's shutdown time seemed to be affected by:

  • Number of pipeline workers
  • Elasticsearch exporter configuration

Impact

  • Resource leaks: Zombie processes consume PIDs and kernel memory
  • Operational issues: Accumulation of zombies over multiple restarts
  • Config update delays: EDOT subprocess restarts on every config change, and 20+ second shutdowns create significant latency

This fix ensures proper process cleanup regardless of shutdown duration while maintaining graceful termination when possible.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Users may notice:

  • Agent restarts take slightly longer (up to 35 seconds instead of 5 seconds in worst case)
  • However, this ensures clean shutdowns and prevents zombie accumulation
  • The tradeoff is worthwhile as zombie processes can cause operational issues over time

How to test this PR locally

Run TestMetricsMonitoringCorrectBinaries integration test

Related issues

@pkoutsovasilis pkoutsovasilis self-assigned this Oct 17, 2025
@pkoutsovasilis pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch backport-9.2 Automated backport to the 9.2 branch labels Oct 17, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/cordinator_timeout branch 2 times, most recently from 6186951 to a32c16f Compare October 21, 2025 10:11
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review October 21, 2025 10:32
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner October 21, 2025 10:32
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some relatively minor nitpicks.

@swiatekm swiatekm self-requested a review October 21, 2025 14:33
swiatekm
swiatekm previously approved these changes Oct 21, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/cordinator_timeout branch 2 times, most recently from 2cae2b1 to b37db6c Compare October 23, 2025 07:42
Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits, but it looks good, will approve after green CI

cfg.Settings.Collector,
monitor.ComponentMonitoringConfig,
cfg.Settings.ProcessConfig.StopTimeout,
3*time.Second, // this needs to be shorter than 5 * time.Seconds (coordinator.managerShutdownTimeout) otherwise we might end up with defunct processes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth to make it a const with a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in f66af91

s.log.Warnf("timeout waiting (%s) for the supervised collector to stop, killing it", waitTime.String())
// our caller ctx is Done; kill the process just in case
_ = s.processInfo.Kill()
case <-time.After(1 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so worst case is waitTime + 1s. please update func docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in f66af91

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pkoutsovasilis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch backport-9.2 Automated backport to the 9.2 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[beats receivers] Defunct elastic-agent otel --supervised process left behind when Elastic Agent re-executes itself

4 participants